For my final project, I selected Los Angeles crime dataset starting from 2020 to examine local areas with the highest crime, most common crime, and victim demographics. The original dataset has over 500,000 observations and 29 variables that describe the date, time, crime, weapon, victim, and location. In this presentation, I have displayed the time of year with the highest crime, most common crimes in top areas, and victim demographic breakdown.
The dataset I used was in a .csv file. I renamed the columns and dropped 10 columns that were not relevant for analysis. Then, I reformatted date and time variables to extract month and year and to use them as variables. Since the year is not over, I removed 2022 for certain time related EDA/analysis.
As for coordinates, there were 2,266 coordinates that are (0,0) due to privacy reasons. I decided to drop these rows since the dataset is already large and it would not affect analysis.
Victim Age had negative values, 0’s (meaning not available), and an oddly high value of 120. For simplicity, I filtered age from 0-100 to exclude negative values. This step discarded 25 rows. The variables Weapon, Victim Sex, and Victim Ethnicity had thousands of empty cells. For Victim Sex and Ethnicity, I replaced it with NA instead of dropping it. I assumed that the empty cells meant that the information was not given. As for weapon, I assigned the empty cells to “NONE” meaning that no weapon was used.
After data cleaning, the process removed 2,291 rows. The crimedat dataset now has 584,004 observations and 21 variables. For further analysis, I subsetted the data to top 5 areas with the highest crime because of how large the dataset is.
Like mentioned before, 2022 is excluded from this analysis since the year is not over. Focusing on thetop 5 areas with the highest crime, I saw that January to June there are a little under 10,000 reports. However, in the warmer months, the reports increase over 10,000 and then drop again in November and December.
There are two dots per area for each year (2020 and 2021). Across the plot, the highest points (blue and orange) are from the areas 77th Street and Central. There are over 1,500 reports of burglary from vehicle and vehicle stolen.
From the histogram, there is a noticeable spike in women in their 20s reporting crimes compared to men. However after 30s, there are more men reporting crime than women. When examining victim’s ethnicity, there are high number of reports by Hispanic residents in their 20s through 40s. Second highest number of victims is White then Black.
From the EDA displayed, there are a higher number of crime reports in the warmer months than colder months. This would be interesting to look into as for what possible factors could be possibly contributing to that. In the areas with the highest rate of crime, theft from vehicle and vehicle theft is pretty common in the areas of 77th Street and Central. As for victim demographics, younger women, older men, and Hispanics report more crime.
Copyright © 2020, Misha Khan.